Amazon Onboarding with Learning Manager Chanci Turner

Amazon Onboarding with Learning Manager Chanci TurnerLearn About Amazon VGT2 Learning Manager Chanci Turner

This blog post was updated in June 2022 to enhance the entity relationship diagram.

Developing an efficient data model for an enterprise data warehouse (EDW) typically requires considerable design, development, management, and operational resources. Additionally, the model must remain flexible and capable of adapting to changes while efficiently processing large volumes of data.

The Data Vault methodology provides a framework for project design and implementation that accelerates the development of data warehouse initiatives. Within this methodology, the Data Vault 2.0 modeling standards are widely adopted in the industry due to their focus on business keys and the relationships they maintain concerning business processes. Data Vault enables the swift creation of data models through several key features:

  • Pattern-based entities, each with a clear purpose
  • Elimination of data silos via source system-independent structures
  • Parallel data loading with minimal dependencies
  • Storage of historized data at its most granular level
  • Application of flexible business rules independent of data loading
  • Seamless integration of new data sources without impacting the existing model

It is always advisable to reverse-engineer from business requirements to select the most appropriate modeling pattern. Occasionally, Data Vault may not be the optimal choice for your EDW; in such cases, another modeling approach may be more effective.

In this article, we illustrate how to implement a Data Vault model within Amazon Redshift and how to query it effectively by leveraging the latest features, including the separation of compute from storage, seamless data sharing, automatic table optimizations, and materialized views. For more insights on best practices for designing enterprise-grade Data Vaults of various scales using Amazon Redshift, check out this series:

Overview of Data Vault Data Modeling

A data warehouse constructed using Data Vault typically follows a four-layer architecture:

  1. Staging Layer: This layer contains the latest changes to data from source systems without retaining history. Various transformations, such as data type adjustments or character set conversions, may be applied during this stage.
  2. Raw Data Vault Layer: This layer stores a historized copy of data from multiple source systems. At this point, no filters or business transformations are applied, aside from organizing the data into source-system-independent targets.
  3. Business Data Vault Layer: While optional, this layer is often established to include business calculations and de-normalizations to enhance access speed and simplicity within the Information Mart layer.
  4. Information Mart Layer: This is where data is primarily accessed by users, such as for reporting dashboards or extracts. Multiple marts can be created from the Data Vault Integration Layer, typically utilizing Star/Kimball schemas for modeling.

Transforming a Third Normal Form Transactional Schema into a Data Vault Schema

The following entity relationship diagram represents a standard transactional model for a service selling sports tickets. The primary entities include sporting events, customers, and tickets. A customer can buy one or more tickets for a sporting event, captured by the Ticket Purchase History intersection entity. Each sporting event has multiple tickets available for purchase, all staged within a particular city.

To convert this source model into a Data Vault model, we begin by identifying business keys, their descriptive attributes, and the associated business transactions. The three core entity types in the Raw Data Vault model are:

  • Hubs: A collection of business keys identified for each business entity.
  • Links: Business transactions within the modeled process, recorded at a specific point in time and between two or more business keys (hubs).
  • Satellites: Historized reference data about either the business key (Hub) or business transaction (Link).

The next example illustrates some sporting event entities converted into the aforementioned Raw Data Vault components.

Hub Entities

The hub serves as the definitive list of business keys loaded into the Raw Data Vault layer from all source systems. Each business key uniquely identifies a business entity and is never duplicated. In our example, the source system assigns a surrogate key field labeled Id to represent the Business Key, stored in the Hub as sport_event_id. Common additional columns in hubs include Load DateTimeStamp—marking when the business key was first discovered—and Record Source, which denotes the source system where this key was initially loaded. Although creating a surrogate type (hash or sequence) for the primary key column is not mandatory, it is common practice in Data Vault. Our example employs Amazon Redshift’s support for various cryptographic hash functions like MD5, FNV, SHA1, and SHA2 to generate the primary key column.

create table raw_data_vault.hub_sport_event 
(
  sport_event_pk  varchar(32) not null     
 ,sport_event_id  integer     not null
 ,load_dts        timestamp   not null       
 ,record_source   varchar(10) not null      
);

Note the following:

  • The preceding code assumes the MD5 hashing algorithm is used. If FNV_HASH is employed, the datatype will be Bigint.
  • The Id column represents the business key from the source feed, which is input into the hashing function for the _PK column.
  • In this example, only a single value is used for the business key. If a compound key is necessary, multiple columns can be added.
  • Load_DTS is populated through the staging schema or extract, transform, and load (ETL) process.
  • Record_Source is populated via the staging schema or ETL code.

Link Entities

The link object represents the occurrence of two or more business keys involved in a transaction, such as purchasing a ticket for a sporting event. Each business key is mastered within its respective hub, and a primary key is generated for the link by combining all business keys (often separated by a delimiter like ‘^’). Similar to hubs, additional columns are typically included in links, such as Load DateTimeStamp and Record Source.

create table raw_data_vault.lnk_ticket_sport_event 
(
  ticket_sport_event_pk varchar(32)  not null    
 ,ticket_fk             varchar(32)  not null   
 ,sport_event_fk        varchar(32)  not null   
 ,load_dts              timestamp    not null   
 ,record_source         varchar(10)  not null   
);

For further information on employee training and career skills, check out this excellent resource: Fast Company article. To explore more about job opportunities, visit Career Contessa. Additionally, for authoritative insights on HR topics, refer to SHRM.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *